RFM Analysis
Estimated reading time: 8 minutesRFM Analysis
This iPython notebook explains how to perform RFM analysis from customer purchase history data. The sample orders file is Sample - Superstore dataset from Tableau Software.
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
Read the sample orders file, containing all past purchases for all customers.
orders = pd.read_csv('sample-orders2.csv',sep=',')
orders.head()
order_date | order_id | customer | grand_total | |
---|---|---|---|---|
0 | 9/07/11 | CA-2011-100006 | Dennis Kane | 378 |
1 | 7/08/11 | CA-2011-100090 | Ed Braxton | 699 |
2 | 3/14/11 | CA-2011-100293 | Neil Französisch | 91 |
3 | 1/29/11 | CA-2011-100328 | Jasper Cacioppo | 4 |
4 | 4/08/11 | CA-2011-100363 | Jim Mitchum | 21 |
Create the RFM Table
Since recency is calculated for a point in time and the Tableau Super Store dataset last order date is Dec 31 2014, that is the date we will use to calculate recency.
Set this date to the current day and extract all orders until yesterday.
import datetime as dt
NOW = dt.datetime(2014,12,31)
# Make the date_placed column datetime
orders['order_date'] = pd.to_datetime(orders['order_date'])
Create the RFM Table
rfmTable = orders.groupby('customer').agg({'order_date': lambda x: (NOW - x.max()).days, # Recency
'order_id': lambda x: len(x), # Frequency
'grand_total': lambda x: x.sum()}) # Monetary Value
rfmTable['order_date'] = rfmTable['order_date'].astype(int)
rfmTable.rename(columns={'order_date': 'recency',
'order_id': 'frequency',
'grand_total': 'monetary_value'}, inplace=True)
Validating the RFM Table
rfmTable.head()
recency | frequency | monetary_value | |
---|---|---|---|
customer | |||
Aaron Bergman | 415 | 3 | 887 |
Aaron Hawkins | 12 | 7 | 1744 |
Aaron Smayling | 88 | 7 | 3050 |
Adam Bellavance | 54 | 8 | 7756 |
Adam Hart | 34 | 10 | 3249 |
Customer Aaron Bergman has frequency:3, monetary value:$887 and recency:415 days.
aaron = orders[orders['customer']=='Aaron Bergman']
aaron
order_date | order_id | customer | grand_total | |
---|---|---|---|---|
624 | 2011-02-19 | CA-2011-152905 | Aaron Bergman | 13 |
665 | 2011-03-07 | CA-2011-156587 | Aaron Bergman | 310 |
2336 | 2013-11-11 | CA-2013-140935 | Aaron Bergman | 564 |
Inserting the date of Aaron purchase and comparing it to the recency in the rfmTable we verify our RFM table is correct.
(NOW - dt.datetime(2013,11,11)).days==415
True
Determining RFM Quartiles
quantiles = rfmTable.quantile(q=[0.25,0.5,0.75])
#25th, 50th, 75th percentile.
quantiles
recency | frequency | monetary_value | |
---|---|---|---|
0.25 | 30.0 | 5.0 | 1145.0 |
0.50 | 75.0 | 6.0 | 2257.0 |
0.75 | 183.0 | 8.0 | 3784.0 |
Send quantiles to a dictionary, easier to use.
quantiles = quantiles.to_dict()
quantiles
{'frequency': {0.25: 5.0, 0.5: 6.0, 0.75: 8.0},
'monetary_value': {0.25: 1145.0, 0.5: 2257.0, 0.75: 3784.0},
'recency': {0.25: 30.0, 0.5: 75.0, 0.75: 183.0}}
Creating the RFM segmentation table
rfmSegmentation = rfmTable
We create two classes for the RFM segmentation since, being high recency is bad, while high frequency and monetary value is good.
# Arguments (x = value, p = recency, monetary_value, frequency, k = quartiles dict)
def RClass(x,p,d):
if x <= d[p][0.25]:
return 1
elif x <= d[p][0.50]:
return 2
elif x <= d[p][0.75]:
return 3
else:
return 4
# Arguments (x = value, p = recency, monetary_value, frequency, k = quartiles dict)
def FMClass(x,p,d):
if x <= d[p][0.25]:
return 4
elif x <= d[p][0.50]:
return 3
elif x <= d[p][0.75]:
return 2
else:
return 1
rfmSegmentation['R_Quartile'] = rfmSegmentation['recency'].apply(RClass, args=('recency',quantiles,))
rfmSegmentation['F_Quartile'] = rfmSegmentation['frequency'].apply(FMClass, args=('frequency',quantiles,))
rfmSegmentation['M_Quartile'] = rfmSegmentation['monetary_value'].apply(FMClass, args=('monetary_value',quantiles,))
rfmSegmentation['RFMClass'] = rfmSegmentation.R_Quartile.map(str) \
+ rfmSegmentation.F_Quartile.map(str) \
+ rfmSegmentation.M_Quartile.map(str)
rfmSegmentation.head()
recency | frequency | monetary_value | R_Quartile | F_Quartile | M_Quartile | RFMClass | |
---|---|---|---|---|---|---|---|
customer | |||||||
Aaron Bergman | 415 | 3 | 887 | 4 | 4 | 4 | 444 |
Aaron Hawkins | 12 | 7 | 1744 | 1 | 2 | 3 | 123 |
Aaron Smayling | 88 | 7 | 3050 | 3 | 2 | 2 | 322 |
Adam Bellavance | 54 | 8 | 7756 | 2 | 2 | 1 | 221 |
Adam Hart | 34 | 10 | 3249 | 2 | 1 | 2 | 212 |
# Uncomment any of the following lines to: copy data to clipboard or save it to a CSV file.
# rfmSegmentation.to_clipboard()
# rfmSegmentation.to_csv('rfm-table.csv', sep=',')
Who are the top 5 best customers? by RFM Class (111), high spenders who buy recently and frequently?
rfmSegmentation[rfmSegmentation['RFMClass']=='111'].sort_values('monetary_value', ascending=False).head(5)
recency | frequency | monetary_value | R_Quartile | F_Quartile | M_Quartile | RFMClass | |
---|---|---|---|---|---|---|---|
customer | |||||||
Sanjit Engle | 9 | 11 | 12210 | 1 | 1 | 1 | 111 |
John Lee | 21 | 11 | 9801 | 1 | 1 | 1 | 111 |
Pete Kriz | 9 | 12 | 8647 | 1 | 1 | 1 | 111 |
Harry Marie | 2 | 10 | 8237 | 1 | 1 | 1 | 111 |
Lena Creighton | 16 | 12 | 7661 | 1 | 1 | 1 | 111 |
## Question, can the RFM class be automated using KNN - What clusters does it find.